Importing Required Packages

Reading the Data

Notes:

Initial Sanity Checks and Data Cleaning

Notes:

The integer variables are inherently categorical with only two classes, so, in order to analyze them better during EDA, with the exception of the target variable, I choose to convert them to categorical variables.

Notes:

Two columns in the data are, in fact, numerical variables, but, due to the presence of some special chatacters (% and $), have been considered object-type by Python. Above actions have fixed the issue.

Notes:

Here, I dropped all features with only one class (one unique value) present in them (I counted missing values as a separate class), as they don't carry any value for classification.

Notes:

Notes:

Notes:

Missing value treatment for categorical variables done successfully!

Notes:

Looking at the head and bottom of the dataset showed that days of the week are spelled in two different ways: abbreviated and complete. Here, I've merged the two ways.

Notes:

EDA: Univariate Analysis

Numerical Variables

Observations:

Categorical Variables

Target Variable

Observations:

As can be seen, the overwhelming majority of the target class is zero. The presence of imbalanced classes thus requires weighting classes or resampling the data to build more accurate classification models.

Categorical Variables With Three or Fewer Classes

Observations:

Day of the Week

Observations:

The early weekdays are present in the data more frequently than the other days.

State

Observations:

Month of the Year

Observations:

There exists records for some months (January, July, August and December) far more than the other months.

Insurance

Observations:

The top insurers in the data are Progressive and Allstate, respectively.

Vehicle Type

Observations:

EDA: Bivariate Analysis

Numerical Variables

Observations:

Correlation Among Numerical Variables

Observations:

Categorical Variables

Categorical Variables With Three or Fewer Classes

Observations:

Among categorical features with three or fewer classes, only x31 and x93 (and slightly x98) appear to be useful in distinguishing between the target classes. For the rest, the frequency distribution w.r.t. the target classes don't seem to be meaningfully different.

Day of the Week

Observations:

The fractions of the target classes somewhat vary with the day of the week.

State

Observations:

Month of the Year

Observations:

The frequency distribution of the target variable also, up to some degree, is a function of month.

Insurance

Vehicle Type

Observations:

The insurer and vehicle type also appear to be slightly helpful in separating the target classes.

Data Preparation for Modeling

Notes:

Notes:

I ensured that target classes are represented with the same proportion in both training and validation sets, making the application of the models to unseen data more reliable.

Missing Value Treatment

Notes:

Here, I ensured only categorical variables with two classes are included for missing value treatment, as giving numbers to those with more than two classes would be tricky.

Notes:

I ensured all variables included for missing value treatment have been rendered numerical, as it's essential for the proper implementation of the imputer.

Notes:

Observations:

Post-Imputation EDA

Observations:

Feature Engineering

Notes:

Quantification of All Features

Notes:

Notes:

Here, we address the issue of imbalanced data using undersampling, which I explored to be more effective and yielding better-performing models than when oversampling has been done. The comparison hasn't been shown for the sake of brevity.

Building the Classification Models

Logistic Regression

XGBoost Classifier

Observations:

Model Tuning

Logistic Regression

XGBoost Classifier

Observations:

Generating Prediction for Test Data

Loading the Test Set

Cleaning the Test Set and Preparing It for Modeling

Notes:

Although not shown, I conducted a comprehensive EDA on the test set, and ensured . To avoid making the code lengthy, it's not included in this submitted version.

Making Predictions Using Best Models and Saving the Results